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Abstract 

Background: An important question in genome evolution is whether there exist fragile regions (rearrangement 
hotspots) where chromosomal rearrangements are happening over and over again. Although nearly all recent 
studies supported the existence of fragile regions in mammalian genomes, the most comprehensive phylogenomic 
study of mammals raised some doubts about their existence. 

Results: Here we demonstrate that fragile regions are subject to a birth and death process, implying that fragility 
has a limited evolutionary lifespan. 

Conclusions: This finding implies that fragile regions migrate to different locations in different mammals, 
explaining why there exist only a few chromosomal breakpoints shared between different lineages. The birth and 
death of fragile regions as a phenomenon reinforces the hypothesis that rearrangements are promoted by 
matching segmental duplications and suggests putative locations of the currently active fragile regions in the 
human genome. 



Background 

In 1970 Susumu Ohno [1] came up with the Random 
Breakage Model (RBM) of chromosome evolution, 
implying that there are no rearrangement hotspots in 
mammalian genomes. In 1984 Nadeau and Taylor [2] 
laid the statistical foundations of RBM and demon- 
strated that it was consistent with the human and 
mouse chromosomal architectures. In the next two dec- 
ades, numerous studies with progressively increasing 
resolution made RBM the de facto theory of chromo- 
some evolution. 

RBM was refuted by Pevzner and Tesler [3] who sug- 
gested the Fragile Breakage Model (FBM) postulating 
that mammalian genomes are mosaics of fragile and 
solid regions. In contrast to RBM, FBM postulates that 
rearrangements are mainly happening in fragile regions 
forming only a small portion of the mammalian gen- 
omes. While the rebuttal of RBM caused a controversy 
[4-6], Peng et al. [7] and Alekseyev and Pevzner [8] 
revealed some flaws in the arguments against FBM. 
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Furthermore, the rebuttal of RBM was followed by 
many studies supporting FBM [9-31]. 

Comparative analysis of the human chromosomes 
reveals many short adjacent regions corresponding to 
parts of several mouse chromosomes [32]. While such a 
surprising arrangement of synteny blocks points to 
potential rearrangement hotspots, it remains unclear 
whether these regions reflect genome rearrangements or 
duplications/assembly errors/alignment artifacts. Early 
studies of genomic architectures were unable to distin- 
guish short synteny blocks from artifacts and thus were 
limited to constructing large synteny blocks. Ma et al. 
[33] addressed the challenge of constructing high-reso- 
lution synteny blocks via the analysis of multiple gen- 
omes. Remarkably, their analysis suggests that there is 
limited breakpoint reuse, an argument against FBM, that 
led to a split among researchers studying chromosome 
evolution and raised a challenge of reconciling these 
contradictory results. Ma et al. [33] wrote: 'a careful 
analysis [of the RBM vs FBM controversy] is beyond the 
scope of this study' leaving the question of interpreting 
their findings open. Various models of chromosome 
evolution imply various statistics and thus can be veri- 
fied by various tests. For example, RBM implies expo- 
nential distribution of the synteny block sizes, consistent 



o 



© 2010 Alekseyev et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative 
BlolVICCl Central Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Alekseyev and Pevzner Genome Biology 2010, 11:R117 
http://genomebiology.eom/2010/1 1/1 1/R117 



Page 2 of 1 5 



with the human-mouse synteny blocks observed in [2], 
Pevzner and Tesler [3] introduced the 'pairwise break- 
point reuse' test and demonstrated that while RBM 
implies low breakpoint reuse, the human-mouse synteny 
blocks expose rampant breakpoint reuse. Thus RBM is 
consistent with the 'exponential length distribution' test 
[2] but inconsistent with the 'pairwise breakpoint reuse' 
test [34]. Both these tests are applied to pairs of gen- 
omes, not taking an advantage of multiple genomes that 
were recently sequenced. Below we introduce the 'multi- 
species breakpoint reuse' test and demonstrate that both 
RBM and FBM do not pass this test. We further pro- 
pose the Turnover Fragile Breakage Model (TFBM) that 
extends FBM and complies with the multispecies break- 
point reuse test. 

Technically, findings in [33] (limited breakpoint reuse 
between different lineages) are not in conflict with find- 
ings in [3] (rampant breakpoint reuse in chromosome 
evolution). Indeed, Ma et al. [33] only considered reuse 
between different branches of the phylogenetic tree 
{inter-reuse) and did not analyze reuse within individual 
branches (intra-reuse) of the tree. TFBM reconciles the 
recent studies supporting FBM with the Ma et al. [33] 
analysis. We demonstrate that data in [33] reveal ram- 
pant but elusive breakpoint reuse that cannot be 
detected via counting repeated breakages between var- 
ious pairs of branches of the evolutionary tree. TFBM is 
an extension of FBM that reconciles seemingly contra- 
dictory results in [9-31] and [33] and explains that they 
do not contradict to each other. TFBM postulates that 
fragile regions have a limited lifespan and implies that 
they can migrate between different genomic locations. 
The intriguing implication of TFBM is that few regions 
in a genome are fragile at any given time raising a ques- 
tion of finding the currently active fragile regions in the 
human genome. 

While many authors have discussed the causes of fra- 
gility, the question what makes certain regions fragile 
remains open. Previous studies attributed fragile regions 
to segmental duplications [35-38], high repeat density 
[39], high recombination rate [40], pairs of tRNA genes 
[41,42], inhomogeneity of gene distribution [7], and long 
regulatory regions [7,17,26]. Since we observed the birth 
and death of fragile regions, we are particularly inter- 
ested in features that are also subject to birth and death 
process. Recently, Zhao and Bourque [38] provided a 
new insight into association of rearrangements with seg- 
mental duplications by demonstrating that many rear- 
rangements are flanked by Matching Segmental 
Duplications (MSDs), that is, a pair of long similar 
regions located within a pair of breakpoint regions cor- 
responding to a rearrangement event. MSDs arguably 
represent an ideal match for TFBM among the features 
that were previously implicated in breakpoint reuses. 



TFBM is consistent with the hypothesis that MSDs pro- 
mote fragility since the similarity between MSDs dete- 
riorates with time, implying that MSDs are also subjects 
to a 'birth and death' process. 

Results and Discussion 

Rearrangements and breakpoint graphs 

For the sake of simplicity, we start our analysis with cir- 
cular genomes consisting of circular chromosomes. 
While we use circular chromosomes to simplify the 
computational concepts discussed in the paper, all ana- 
lysis is done with real (linear) mammalian chromosomes 
(see Alekseyev [43] for subtle differences between circu- 
lar and linear chromosome analysis). We represent a cir- 
cular chromosome with synteny blocks 
cycle (Figure la) composed of n directed labeled edges 
(corresponding to the blocks) and n undirected unla- 
beled edges (connecting adjacent blocks). The directions 
of the edges correspond to signs (strands) of the blocks. 
We label the tail and head of a directed edge x t as x* 
and xf respectively. We represent a genome as a gen- 
ome graph consisting of disjoint cycles (one for each 
chromosomes). The edges in each cycle alternate 
between two colors: one color reserved for undirected 
edges and the other color (traditionally called 'obverse') 
reserved for directed edges. 

Let P be a genome represented as a collection of alter- 
nating black-obverse cycles (a cycle is alternating if the 
colors of its edges alternate). For any two black edges 
(w; v) and (x; y) in the genome (graph) P , we define a 
2-break rearrangement (see [44]) as replacement of 
these edges with either a pair of edges (it, x), (v, y ), or a 
pair of edges (m, y), (v, x) (Figure 2). 2-breaks extend the 
standard operations of reversals (Figure 2a), fissions 
(Figure 2b), or fusions/translocations (Figure 2c) to the 
case of circular chromosomes. We say that a 2-break on 
edges (u, x), (v, y) uses vertices u, x, v and y. 

Let P and Q be 'black' and 'red' genomes on the same 
set of synteny blocks X. The breakpoint graph G(P, Q ) 
is defined on the set of vertices V = {x l \ x h \ x e %} with 
black and red edges inherited from genomes P and Q 
(Figure lb). The black and red edges form a collection 
of alternating black-red cycles in G(P, Q ) and play an 
important role in analyzing rearrangements (see [45] for 
background information on genome rearrangements). 
The trivial cycles in G(P, Q), formed by pairs of parallel 
black and red edges, represent common adjacencies 
between synteny blocks in genomes P and Q. Vertices of 
the non-trivial cycles in G(P, Q) represent breakpoints 
that partition genomes P and Q into (P, Q)-synteny 
blocks (Figure lc). The 2-break distance d[P, Q) 
between circular genomes P and Q is defined as the 
minimum number of 2-breaks required to transform 
one genome into the other (Figure Id). In contrast to 
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Figure 1 An example of the breakpoint graph and its transformation into an identity breakpoint graph (a) Graph representation of a 

two-chromosomal genome P = (+a + b)(+c + e + -d) as two black-obverse cycles and a unichromosomal genome Q = (+o + b- e + c- c/)asa 

red-obverse cycle, (b) The superposition of the genome graphs P and Q. (c) The breakpoint graph G(P, Q) of the genomes P and Q (with 

removed obverse edges). The black and red edges in G(P, Q) form c{P, Q) = 2 non-trivial black-red cycles and one trivial black-red cycle. The 

trivial cycle (a h , b') corresponds to a common adjacency between the genes a and b in the genomes P and Q. The vertices in the non-trivial 

cycles represent breakpoints corresponding to the endpoints of b(P, Q) = 4 synteny blocks: ab, c, d, and e. By Theorem 1, the distance between 

the genomes P and 0 is d(P, Q) = 4 - 2 = 2. (d) A transformation of the breakpoint graph G(P, Q) into the identity breakpoint graph G(Q, Q), 

corresponding to a transformation of the genome P into the genome Q with two 2-breaks. The first 2-break transforms P into a genome f = 

(+a + b)(+c d - e), while the second 2-break transforms f into Q. Each 2-break increases the number of black-red cycles in the breakpoint graph 

by one, implying this transformation is shortest (see Theorem 1). 
v J 




Figure 2 A 2-break on edges (u, v) and (x, y) corresponding to (a) reversal, (b) fission, (c) translocation/fusion 

V ) 
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the genomic distance [46] (for linear genomes), the 2- 
break distance for circular genomes is easy to compute 
[47]: 

Theorem 1 The 1-break distance between circular 
genomes P and Q is d(P, Q) = b{P, Q) - c{P, Q), where b 
{P, Q ) and c(P, Q) are respectively the number of (P, 
Q)-synteny blocks and non-trivial black-red cycles in G 
(P, Q). 

Inter- and intra-breakpoint reuse 

Figure 3 shows a phylogenetic tree with specified rear- 
rangements on its branches (we write p e e to refer to a 
2-break p on an edge e ). We represent each genome as 
a genome graph (that is, a collection of cycles) on the 
same set V of 2n vertices (corresponding to the end- 
points of the synteny blocks). Given a set of genomes 
and a phylogenetic tree describing rearrangements 
between these genomes, we define the notions of inter- 
and intra-breakpoint reuses. A vertex v e V is inter- 
reused on two distinct branches e 1 and e 2 of a phyloge- 
netic tree if there exist 2-breaks pi e e± and p 2 e e 2 
that both use v. Similarly, a vertex »e V is intra-reused 
on a branch e if there exist two distinct 2-breaks pi, p 2 
e e that both use v. For example, a vertex c h is inter- 
reused on the branches (Q 3 , Pi) and (Q 2 , P 3 ), while a 
vertex f 1 is intra-reused on the branch (Q 3 , Q 2 ) of the 
tree in Figure 3. We define brifii, e 2 ) as the number of 
vertices inter-reused on the branches ei and e 2 , and br 
(e) as the number of vertices intra-reused on the branch 
e. An alternative approach to measuring breakpoint 



intra-reuse is to define weighted intra-reuse of a vertex v 
on a branch e as max{0, use(e, v) -1} where use(e, v) is 
the number of 2-breaks on e using v. The weighted 
intra-reuse BR(e ) on the branch e is the sum of 
weighted intra-reuse of all vertices. We remark that if 
no vertex is used more than twice on a branch e then 
BR(e) = br{e). 

Given simulated data, one can compute br(e) for all 
branches and br(ei, e 2 ) for all pairs of branches in the 
phylogenetic tree. However, for real data, rearrange- 
ments along the branches are unknown, calling for alter- 
native ways for estimating the inter- and intra-reuse. 

Cycles in the breakpoint graphs provide yet another 
way to estimate the inter- and intra-reuse. For a branch 
e = (P, Q) of the phylogenetic tree, one can estimate br 
(e) by comparing the 2-break distance d(P, Q ) and the 
number of breakpoints 2 • b(P, Q) between the genomes 
P and Q. This results in the lower bound bound(e) = 4 • 
d(P, Q) -2 • b{P, Q) for BR(e) [34] that also gives a good 
approximation for br(e ). On the other hand, one can 
estimate br(ei, e 2 ) as the number bound{ei, e 2 ) of ver- 
tices shared between non-trivial cycles in the breakpoint 
graphs corresponding to the branches e x and e 2 (similar 
approach was used in [48] and later explored in [12,33]). 
Assuming that the genomes at the internal nodes of the 
phylogenetic tree can be reliably reconstructed 
[33,49-51], one can compute bound{e) and bound{ei, e 2 ) 
for all (pairs of) branches. Below we show that these 
bounds accurately approximate the intra- and inter- 
reuse. 



a) 




P!=(+a-c-b)(+d+e+f) 



P 3 =(+a-d)(-c-b+e-f) 



Q 3 =(+a+b+c)(+d+e+f) 



P 2 =(+d+e+b+c)(+a+f) 




Q 2 =(+a-d-c-b+e-f) 



P 4 =(+d-a-c-b+e-f) 




Figure 3 An example of four genomes with a phylogenetic tree and their multiple breakpoint graph (a) A phylogenetic tree with four 
circular genomes P h P 2 , P3, P 4 (represented as green, blue, red, and yellow graphs respectively) at the leaves and specified intermediate 
genomes. The obverse edges are not shown, (b) The multiple breakpoint graph G(P q , P 2 , P-$, P4) is a superposition of graphs representing 
genomes P,, P 2 , P 3 , P 4 . 
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Analyzing breakpoint reuse (simulated genomes) 

We start from analyzing simulated data based on FBM 
with n fragile regions present in k genomes that evolved 
according to a certain phylogenetic tree (for the varying 
parameter n ). We represent one of the leaf genomes as 
the genome with 20 random circular chromosomes and 
simulate hundred 2-breaks on each branch of the tree. 

Figure 4 represents a phylogenetic tree on five leaf 
genomes, denoted M, R, D, Q, H, and three ancestral 
genomes, denoted MR, MRD, QH. Table in Figure 5 
presents the results of a single FBM simulation and 
illustrates that bound{e\, e 2 ) provides an excellent 
approximation for inter-reuses brifix, e 2 ) for all 21 pairs 
of branches. While bound{e) (on the diagonal of table in 
Figure 5) is somewhat less accurate, it also provides a 
reasonable approximation for br(e). We remark that 
bound{ei, e 2 ) = br{e\, e 2 ) if simulations produce the 
shortest rearrangement scenarios on the branches e^ 
and e 2 . Table in Figure 5 illustrates that this is mainly 
the case for our simulations. 

Below we describe analytical approximations for the 
values in table in Figure 5. Since every 2-break uses four 
out of In vertices in the genome graph, a random 2- 

break uses a vertex u with the probability — . Thus, a 

sequence of t random 2-breaks does not use a vertex v 

2t 

with the probability (i-^y ae " ( for r « n) • F° r 

n 

branches e x and e 2 with respectively ti and t 2 random 
2-breaks, the probability that a particular vertex is 
inter-reused on e\ and e 2 is approximated as 



. Therefore, the expected number 
approximated as 
. Below we will compare the 



2tj _2t 2 

(l-e " ) (i-e " ) 
of inter-reused vertices 

2n(i-e " ) (i-e n )' 
observed inter-reuse with the expected inter-reuse in 
FBM to see whether they are similar thus checking 
whether FBM represents a reasonable null hypothesis. 
We will use the term scaled inter-reuse to refer to the 
observed inter-reuse divided by the expected inter-reuse. 
If FBM is an adequate null hypothesis we expect the 
scaled inter-reuse to be close to one. 

Similarly, a sequence of t random 2-breaks uses 
a vertex v exactly once with the probability 

( t _j _ 2(f-l) 

. Therefore, the probability of 



a particular vertex being intra-reused on a branch with t 

2f 2(f-l) 



random 2-breaks is approximately j. 



2f 



implying that the expected intra-reuse is approximately 



2n 



l-e 



2f . 



2(f-l) 



. We will use the term scaled 



intra-reuse to refer to the observed n e intra-reuse 
divided by the expected intra-reuse. Table SI in Addi- 
tional file 1 shows the scaled intra- and inter-reuse for 
21 pairs of branches (averaged over 100 simulations) 
and illustrates that they all are close to one. 

We now perform a similar simulation, this time vary- 
ing the number of 2-breaks on the branches according 
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Figure 5 The number of intra- and inter-reuses between seven branches of the tree in Figure 4, each of length 100, for simulated 

genomes with n fragile regions (n = 500, 900, 1, 300). The diagonal elements represent intra-reuses while the elements above diagonal 

represent inter-reuses. In each cell with numbers x : y, x represents the observed reuse while y represents the corresponding lower bound. The 

cells of the table are colored red (for adjacent branches like M+ and fi+), green (for branches that are separated by a single branch like M+ and 

D+ separated by MR+), and yellow (for branches that are separated by two branches like M+ and H+ separated by MR+ and QH+). 
\ ) 



to the branch lengths specified in Figure 4. Table S2 in 
Additional file 1 (similar to Table SI in Additional file 
1) illustrates that the lower bounds also provide accurate 
approximations in the case of varying branch lengths. 
Similar results were obtained in the case of evolutionary 
trees with varying topologies (data are not shown). We 
therefore use only lower bounds to generate table in 
Figure 6 rather than showing both real distances and 
the lower bounds as in table in Figure 5. 

In the case when the branch lengths vary, we find it 
convenient to represent data in Table S2 in Additional 
file 1 in a different way (as a plot) that better illus- 
trates variability in the scaled inter-use. We define the 
distance between branches e\ and e 2 in the phyloge- 
netic tree as the distance between their midpoints, that 
is, the overall length of the path, starting at e\ and 

ending at e 2 , minus 1 For example, 

d{M+, H+) = 56 + 170 + 58 + 28 - 56 * 28 = 270 (see Fig- 
ure 4). The .x-axis in Figure SI in Additional file 1, 2 
represents the distances between pairs of branches (21 



pairs total), while j-axis represents the scaled inter- 
reuse for pairs of branches at the distance x. 

Surprising irregularities in breakpoint reuse in 
mammalian genomes 

The branch lengths shown in Figure 4 actually represent 
the approximate numbers of rearrangements on the 
branches of the phylogenetic tree for .Mouse, Rat, Dog, 
macaQue, and Human genomes (represented in the 
alphabet of 433 'large' synteny blocks exceeding 500, 
000 nucleotides in human genome [50]). For the mam- 
malian genomes, M, R, D, Q, and H, we first used 
MGRA [50] to reconstruct genomes of their common 
ancestors (denoted MR, MRD, and QH in Figure 4) and 
further estimated the breakpoint inter-reuse between 
pairs of branches of the phylogenetic tree. The resulting 
table in Figure 7 reveals some striking differences from 
the simulated data (Figure 6) that follow a peculiar pat- 
tern: the larger is the distance between two branches, 
the smaller is the amount of inter-reuse between them 
(in contrast to RBM/FBM where the amount of inter- 
reuse does not depend on the distance between 
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Figure 6 The estimated number of intra- and inter-reuses bound(e) and bound(e-,, e 2 ) between seven branches with varying branch 
length specified in Figure 4 (data simulated according to FBM). The cells are colored as in Figure 5 



branches). The statement above is imprecise since we 
have not described yet how to compare the amount of 
inter-reuse for different branches at various distances. 
However, we can already illustrate this phenomenon by 
considering branches of similar length that presumably 
influence the inter-reuse in a similar way (see below). 



We notice that branches M+, R+, and QH+ have simi- 
lar lengths (varying from 56 to 68 rearrangements) and 
construct subtables of Figure 6 (for n = 900) and Figure 
7 with only three rows corresponding to these branches 
(Figure 8). Since the lengths of branches M+, R+, and 
QH+ are similar, FBM implies that the elements 
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Figure 7 The estimated number of intra- and inter-reuses bound{e) and bound(e^, e 2 ) between seven branches of the phylogenetic 
tree in Figure 4 of five mammalian genomes (real data) The cells are colored as in Figure 5. 
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Figure 8 Subtables of Figure 6 for n = 900 (top part) and Figure 7 (bottom part) featuring branches M+, R+, and QH+ as one element 
of the pair. The cells are colored as in Figure 5. 



belonging to the same columns in table in Figure 8 
should be similar. This is indeed the case for simulated 
data (small variations within each column) but not the 
case for real data. In fact, maximal elements in each col- 
umn for real data exceed other elements by a factor of 
three to five (with an exception of the MR+ column). 
Moreover, the peculiar pattern associated with these 
maximal elements (maximal elements correspond to red 
cells) suggests that this effect is unlikely to be caused by 
random variations in breakpoint reuses. We remind the 
reader that red cells correspond to pairs of adjacent 
branches in the evolutionary tree suggesting that break- 
point reuse is maximal between close branches and is 
reducing with evolutionary time. A similar pattern is 
observed for the other pairs of branches of similar 
length: adjacent branches feature much higher inter- 
reuse than distant branches. We also remark that the 
most distant pairs of branches (H+ and M+, H+ and R+, 
Q+ and M+, Q+ and R+ in the yellow cells) feature the 
lowest inter-reuse. The only branch that shows relatively 
similar inter-reuse (varying from 58 to 80) with the 
branches M+, R+, and QH+ is the branch MR+ which is 
adjacent to each of these branches. 

Below we modify FBM to come up with a new model 
of chromosome evolution, explaining the surprising irre- 
gularities in the inter-reuse across mammalian genomes. 

Turnover fragile breakage model: birth and death of 
fragile regions 

We start with a simulation of 100 rearrangements on 
every branch of the tree in Figure 4. However, instead 
of assuming that fragile regions are fixed, we assume 
that after every rearrangement x fragile regions 'die' and 
X fragile regions are 'born' (keeping a constant number 
of fragile regions throughout the simulation). We 
assume that the genome has m potentially 'breakable' 
sites but only n of them are currently fragile (« < m) 
(the remaining n - m sites are currently solid). The 
dying regions are randomly selected from n currently 



fragile regions, while the newly born regions are ran- 
domly selected from m - n solid regions. The simplest 
TFBM with a fixed rate of the 'birth and death' process 
is defined by the parameters m, n, and turnover rate x. 
FBM is a particular case of TFBM corresponding to x = 
0 and n <m, while RBM is a particular case of TFBM 
corresponding to x = 0 and n = m. While this over-sim- 
plistic model with a fixed turnover rate may not ade- 
quately describe the real rearrangement process, it 
allows one to analyze the general trends and to compare 
them to the trends observed in real data. We further 
remark that the goal of this paper is to develop a test 
for distinguishing between TFBM and FBM/RBM rather 
than a test for distinguishing between FBM and RBM. 
Thus, our simulations do not distinguish between FBM 
(x = 0 and n <m) and RBM (x = 0 and n = m) since 
they do not affect m - n inactive breakpoints in FBM. 
To distinguish FBM from RBM, one has to analyze the 
long cycles in the breakpoint graph and the distribution 
of synteny block sizes (see [3,8]). 

The leftmost subtable of Figure 9 with x = 0 repre- 
sents an equivalent of table in Figure 5 for FBM and 
reveals that the inter-reuse is roughly the same on all 
pairs of branches (approximately 110 for n = 500, 
approximately 70 for n = 900, approximately 50 for n = 
1, 300). The right subtables of Figure 9 represent 
equivalents of the leftmost subtable for TFBM with the 
turnover rate x = 1, 2, 3 and reveal that the inter-reuse 
in yellow cells is lower than in green cells, while the 
inter-reuse in green cells is lower than in red cells. 

Figure 10 shows the scaled inter-reuse averaged over 
yellow, green, and red cells that reveals a different beha- 
vior between FBM and TFBM. Indeed, while the scaled 
inter-reuse is close to 1 for all pairs of branches in the 
case of FBM, it varies in the case of TFBM. For exam- 
ple, for n = 900, m = 2, 000, and x = 3, the inter-reuse 
in yellow cells is approximately 40, in green cells is 
approximately 45, and in red cells is approximately 56. 
Table S3 in Additional file 1 presents the differences in 
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Figure 9 The breakpoint intra- and inter-reuse (averaged over 100 simulations) for five simulated genomes M, ft, D, Q, H under TFBM 
model with m = 2, 000 synteny blocks, n fragile regions, the turnover rate x, and the evolutionary tree shown in Figure 4 with the 
length of each branch equal 100 The cells are colored as in Figure 5. 
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the inter-reuse between red, green, and yellow cells as a 
function of m and x (for n = 900). In Methods we 
describe a formula for estimating the breakpoint inter- 
reuse in the case of TFBM that accurately approximates 
the values shown in Figure 10. 

Table S3 in Additional file 1 demonstrates that the 
distribution of inter-reuses among green, red, and yellow 
cells differs between FBM and TFBM. We argue that 
this distribution (for example, the slope of the curve in 
Figure 10) represents yet another test to confirm or 
reject FBM/TFBM. However, while it is clear how to 
apply this test to the simulated data (with known rear- 
rangements), it remains unclear how to compute it for 
real data when the ancestral genomes (as well as the 
parameters of the model) are unknown. While the 
ancestral genomes can be reliably approximated using 
the algorithms for ancestral genome reconstruction 
[33,49-51], estimating the number of fragile regions 
remains an open problem (see [3]). Below we develop a 
new test (that does not require knowledge of the num- 
ber of the fragile regions n ) and demonstrate that FBM 
does not pass this test while TFBM does, explaining the 
surprisingly low inter-reuse in mammalian genomes. 

Multispecies breakpoint reuse test 

Given a phylogenetic tree describing a rearrangement 
scenario, we define the multispecies breakpoint reuse on 
this tree as follows. For two rearrangements pi and p 2 
in the scenario, we define the distance d(pi, p 2 ) as the 
number of rearrangements in the scenario between pi 
and p 2 plus one. For example, the distance between 
2-breaks r 4 and r 6 in the tree in Figure 3 is four. We 
define the (actual) multispecies breakpoint reuse as a 
function 

V br(p t ,p 2 ) 

jXfa _ -^Pi.Pa ■d(p 1 ,p 2 )=l: 

El 

that represents the total breakpoint reuse between 
pairs of rearrangements pi, p 2 at the distance / divided 
by the number of such pairs. Here br(pi, p 2 ) stands for 
the number of vertices used by both 2-breaks pi and p 2 . 

Since the rearrangements on branches of the phyloge- 
netic tree are unknown, we use the following sampling 
procedure to approximate R(l). Given genomes P and Q, 
we sample various shortest rearrangement scenarios 
between these genomes by generating random 2-break 
transformations of P into Q. To generate a random 
transformation we first randomly select a non-trivial 
cycle C in the breakpoint graph G(P, Q) with the prob- 
ability proportional to \C\I = 2 - 1, that is, the number 
of 2-breaks required to transform such a cycle into a 



collection of trivial cycles (|C| stands for the length of 
C). Then we uniformly randomly select a 2-break p 

from the set of all ) = 1 C 1 (l f 1 ~ 2) 2-breaks that 

8 

splits the selected cycle C into 2 8 two and thus by The- 
orem 1 decreases the distance between P and Q by one 
(that is, d(p P, Q) = d(P, Q) -1). We continue selecting 
non-trivial cycles and 2-breaks in an iterative fashion for 
genomes p ■ P and Q and so on until P is transformed 
into Q. 

The described sampling can be performed for every 
branch e = (P, Q) of the phylogenetic tree, essentially 
partitioning e into length(e) = d(P, Q) sub-branches, 
each featuring a single 2-break. The resulting tree will 
have Z e length(e) sub-branches, where the sum is taken 
over all branches e. 

For each pair of sub-branches, we compute the num- 
ber of reused vertices across them and accumulate these 
numbers according to the distance between these sub- 
branches in the tree. The empirical multispecies break- 
point reuse (the average reuse between all sub-branches 
at the distance /) is defined as the actual multispecies 
breakpoint reuse in a sampled rearrangement scenario. 
Figure S2 in Additional file 1 represents this function 
for five simulated genomes on m = 2, 000 synteny 
blocks, n = 900 fragile regions, and the turnover rate x 
varying from zero to four, with the same phylogenetic 
tree and distances between the genomes (averaged over 
100 random samplings, while individual samplings pro- 
duce varying results, we found that the variance of the R 
(I) estimates across various samplings is rather small). 
Figure S3 in Additional file 1 demonstrates that our 
sampling procedure, while imperfect, accurately esti- 
mates the theoretical R(l) curve (see [52] for other 
approaches to sampling rearrangement scenarios). Simi- 
lar tests on phylogenetic trees with varying topologies 
demonstrated a good fit between actual, empirical, and 
theoretical R(l) curves (data are not shown). 

For the five mammalian genomes, the plot of R(l) is 
shown in Figure 11. From this empirical curve we esti- 
mated the parameters n » 196, x ~ 1:12, and m ~ 4, 017 
(see Methods) and displayed the corresponding theoreti- 
cal curve. We remark that the estimated parameter n in 
TFBM is expected to be larger than the observed num- 
ber of synteny blocks (since not all potentially breakable 
regions were broken in a given evolutionary scenario). 
Figure S4 in Additional file 1 represents an analog of 
Figure 11 for the same genomes in higher resolution 
and illustrates that all three parameters n, x, and m 
depend on the data resolution. 

We argue that the empirical multispecies breakpoint 
reuse curve R(l) complements the 'exponential length 
distribution' [2] and 'pairwise breakpoint reuse' [3] tests 
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as the third criterion to accept/reject RBM, FBM, and 
now TFBM. One can use the parameters n and x (esti- 
mated from empirical R(l) curve) to evaluate the extent 
of the 'birth and death' process and to explain why Ma 
et al. [33] found so few shared breakpoints between 
different mammalian lineages. In practice, the 'multispe- 
cies breakpoint reuse test' can be applied in the same way 
as the Nadeau- Taylor 'exponential length distribution test' 
was applied in numerous papers. The Nadeau-Taylor test 
typically amounted to constructing a histogram of synteny 
blocks and evaluating (often visually) whether it fits the 
exponential distribution. Similarly, the 'multispecies break- 
point reuse test' amounts to constructing R(l) curve and 
evaluating whether it significantly deviates from a horizon- 
tal line suggested by RBM and FBM. The estimated para- 
meters of the TFBM model (see Methods) can be used to 
quantify the extent of these deviations. 

TFBM also raises an intriguing question of what trig- 
gers the birth and death of fragile regions. As demon- 
strated by Zhao and Bourque [38], the disproportionately 



large number of rearrangements in primate lineages are 
flanked by MSDs. TFBM is consistent with the Zhao- 
Bourque hypothesis that rearrangements are triggered by 
MSDs since MSDs are also subject to the 'birth and 
death' process. Indeed, after a segmental duplication the 
pair of matching segments becomes subjected to random 
mutations and the similarity between these segments dis- 
solves with time (a pair of segmental duplications 'disap- 
pears' after approximately 40 million years of evolution if 
one adopts the parameters for defining segmental dupli- 
cations from [53]). 

The mosaic structure of segmental duplications [53] 
provides an additional explanation of how MSDs may 
promote breakpoint re-uses and generate long cycles 
typical for the breakpoint graphs of mammalian gen- 
omes. The future studies of the correlation between fra- 
gile regions and MSDs in the human genome will 
benefit from the algorithms for precise detection of rear- 
rangement breakpoints [54] and will be described 
elsewhere. 



Alekseyev and Pevzner Genome Biology 2010, 11:R117 
http://genomebiology.eom/2010/1 1/1 1/R117 



Page 1 2 of 1 5 



Fragile regions in the human genome 

Imagine the following gedanken experiment: 25 million 
years ago (time of the human-macaque split) a scientist 
sequences the genome of the human-macaque ancestor 
{QH) and attempts to predict the sites of (future) rear- 
rangements in the (future) human genome. The only 
other information the scientist has is the mouse, rat, 
and dog genomes. While RBM offers no clues on how 
to make such a prediction, FBM suggests that the scien- 
tist should use the breakpoints between one of the avail- 
able genomes and QH as a proxy for fragile regions. For 
example, there are 552 breakpoints between the mouse 
genome (M) and QH and 34 of them were actually used 
in the human lineage, resulting in only 34 = 552 « 6% 
accuracy in predicting future human breakpoints 
(we use synteny blocks larger than 500 K from [50]). 

TFBM suggests that the scientist should rather use the 
closest genome to QH to better predict the human 
breakpoints. That can be achieved by first reconstruct- 
ing the common ancestor {MRD) of mouse, rat, dog, 
and human-macaque ancestor and then using the break- 
points between MRD and QH as a proxy for the sites of 
rearrangements in the human lineage. 18 out 162 break- 
points between MRD and QH were used in the human 
lineage, resulting in 18 = 162 « 11% accurate prediction 
of human breakpoints, nearly doubling the accuracy of 
predictions from distant genomes. 

Now imagine that the scientist somehow gained access 
to the extant macaque genome. There are 68 break- 
points between Q and QH and 10 of them were used in 
the human lineage, resulting in 10 = 68 « 16% accurate 
prediction of human breakpoints, again improving the 
accuracy of predictions. These estimates indicate that 
TFBM can be used to improve the prediction accuracy 
of future rearrangements in various lineages and demon- 
strate that the sites of recent rearrangements in the 
human and other primate lineages represent the best 
guess for the currently active fragile regions in the 
human genome. 

We therefore focus on the incident branches H+, Q+, 
and QH+ and construct the breakpoint graphs G(H, 
QH), G(Q, QH), and G(QH, MRD). Figure S5 in Addi- 
tional file 1 superimposes these three graphs and 
(together with Table S4 in Additional file 1) illustrates 
breakpoints that were inter-reused on the branches H 
+ , Q+, and QH+. Figure 12 shows the positions of 
these recently affected breakpoints (projected to the 
human genome) that, according to TFBM, represent 
the best proxy for the currently active fragile regions 
in the human genome. Various ongoing primate gen- 
ome sequencing projects will soon result in an even 
better estimate for the fragile regions in the human 
genome. 



Conclusions 

Since every species on Earth (including Homo sapiens) 
may speciate into multiple new species, one can ask a 
question: 'How will the human genome evolve in the 
next million years?' TFBM suggests the putative sites of 
future rearrangements in the human genome. The 
answer to the question 'Where are the (future) fragile 
regions in the human genome?' may be surprisingly sim- 
ple: they are likely to be among the breakpoint regions 
that were used in various primate lineages. 

Nadeau and Taylor [2] proposed RBM based on a sin- 
gle observation: the exponential distribution of the 
human-mouse synteny block sizes. There is no doubt 
that jumping to this conclusion was not fully justified: 
there are many other models (for example, FBM) that 
lead to the same exponential distribution of the 'visible' 
synteny block sizes. Currently, there is no single piece of 
evidence that would allow one to claim that RBM is cor- 
rect and FBM is not. 

While Pevzner and Tesler [3] revealed large break- 
point reuse (supporting FBM and contradicting RBM), 
Ma et al. [33] revealed low breakpoint inter-reuse (con- 
tradicting FBM). This discovery calls for yet another 
generalization of FBM. The proposed TFBM model not 
only passes both 'exponential length distribution' test 
(motivation for RBM) and 'pairwise breakpoint reuse' 
test (motivation for FBM) but also explains the puzzling 
discovery of limited breakpoint inter-reuse in [33]. We 
therefore argue that TFBM is a more accurate model of 
chromosome evolution, allowing one to approximate the 
currently active fragile regions in the human genome. 

Needless to say, TFBM, similarly to RBM and FBM 
(or various models of point mutations, for example, 
Jukes-Cantor model), is a simplistic model of chromo- 
some evolution that is only an approximation of the real 
evolutionary process. Moreover, in the current paper we 
considered TFBM only for the case of 2-breaks and did 
not include other rearrangements such as transpositions. 
However, it is fair to assume that transpositions are as 
likely to happen on incident branches as on distant 
branches, implying that they cannot possibly cause the 
reduced breakpoint inter-reuse on distant branches. In 
addition to limitations of TFBM as a model, there exists 
a concern whether computation of empirical multispe- 
cies breakpoint reuse (that requires reconstruction of 
ancestral genomes) may be affected by errors in recon- 
struction of ancestral genomes. While various tools for 
ancestral genome reconstruction (such as MGRA [50] 
and inferCARs [33]) were shown to be quite accurate 
(in particular, they produce nearly identical results while 
using very different algorithms), it is a challenging open 
problem to evaluate the multispecies breakpoint reuse 
without explicitly computing ancestral genomes. 
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The key point of this paper is the birth and death pro- 
cess of fragile regions rather than a specific model 
aimed at estimating the hidden parameters of this pro- 
cess. TFBM is merely an initial and over-simplistic 
attempt to estimate these parameters. The parameters 
predicted by TFBM (for example, the number of active 
fragile regions) are currently difficult to superimpose 
with scarce information about rearrangements in only 
seven reliably completed mammalian genomes, not 
unlike the parameters of RBM derived in 1984 when no 
high-resolution comparative mammalian genomic archi- 
tectures were available. However, similarly to compara- 
tive mapping efforts in early 1990 s that confirmed the 
Nadeau-Taylor estimates, we believe that imminent 
sequencing of over 400 primate species will soon pro- 
vide the detailed information about chromosomal fragi- 
lity in human genome and will allow one to verify the 
TFBM parameters. 

Similarly to the discovery of breakpoint reuse in 2003 
[3], there is currently only indirect evidence supporting 



the birth and death of fragile regions in chromosome 
evolution. However, we hope that, similarly to FBM 
(that led to many follow-up studies supporting the exis- 
tence of fragile regions), TFBM will trigger further 
investigations of the fragile regions longevity. 

Materials and methods 

Computing multispecies breakpoint reuse in the TFBM 
model 

Let Fragile and Solid be the sets of n initial fragile 
regions and m - « initial solid regions respectively. In 
TFBM, the sets Fragile and Solid change in accordance 
with the turnover rate x, that is, after every 2-break x 
fragile regions (corresponding to 1x vertices in the 
breakpoint graph) from Fragile are moved to Solid and 
vice versa. 

For a vertex in the set Fragile, we evaluate the prob- 
ability P{1) that this vertex still belongs to Fragile after / 
2-breaks. After every 2-break, a vertex from Fragile 
moves to Solid with the probability — , while a vertex 
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from Solid moves to Fragile with the probability — - — 
Therefore, 



P(i+i) = ?W.(i--)+(i-PW)~ 

n 777-77 



= (i- 



77(777-77) m-n 

Solution to this recurrence with the initial condition P 

(0) = 1 is ?m = m ~ n fi *™ — f +— • We now 

m I 77(777- 77) I m 

compute the expected reuse between 2-breaks p 1 and p 2 
separated by / other 2-breaks. Since every 2-break uses 
4 vertices, the probability that it uses a particular vertex 

in Fragile is — . Since the 2-break used 4 vertices, the 

expected reuse between p l and p 2 is: 

8(777-77) 



R(0 



■1- /•() 
n 



77(777-77) 



Figure S6 in Additional file 1 demonstrates that this 
formula fits simulated data well, thus opening a possibi- 
lity to determine the parameters m, n, and x for given 
real genomes. 

We remark that if ..r^J 1 ^ <k 1 is approximated by a 
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The difference between empirical and theoretical 
estimates for /?(/) 

Figure S3 in Additional file 1 illustrates the results of 
simulating of 400 2-breaks according to TFBM with 
parameters m = 2, 000, n = 900, X = 1. As expected, the 
theoretical curve and the curve derived from simulated 
data (without sampling of various rearrangement scenar- 
ios) are nearly identical. We now assume that only five 
out of 401 simulated genomes are available (after 0, 100, 
200, 300, and 400 rearrangements) and use sampling of 
rearrangement scenarios to compute the empirical R(l) 
(Figure S3 in Additional file 1). One can see that empiri- 
cal R{1) differs from the theoretical R(l), particularly for 
small'. To understand why the empirical curve (obtained 
via sampling of rearrangement scenarios) differs from 
the theoretical curve, one has to realize that the multi- 
species breakpoint reuse test requires multiple genome 
to reveal the 'birth and death' of fragile regions. Indeed, it 
is impossible to detect this process from only two gen- 
omes: for example, sampling of rearrangement scenarios 
on a single branch (simulated with TFBM with para- 
meters described above) produces a nearly horizontal 



curve R(l) « 0.0083 with TFBM signal lost. The green 
curve follows the same horizontal trend for small / (for 
example / < 100) that typically represent pairs of 2-breaks 
on the same branch. However, for distances larger than 
the shortest branches, the theoretical curve approximates 
the empirical R(l) curve well. The reason this 'horizontal 
trend' is not seen in Figure 11 most likely explained by 
the fact that H+ and Q+ branches in the corresponding 
phylogenetic tree are rather short thus masking this 
effect. 

Additional material 



Additional file 1: Supplementary tables and figures Additional file 1 
contains supplementary Tables SI, S2, S3, S4 and Figures SI, S2, S3, S4, 
S5, 56. 
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FBM: fragile breakage model: MSDs: matching segmental duplications: RBM: 
random breakage model: TFBM: turnover fragile breakage model. 
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